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Abstract 


We  discuss  a  group  of  parallel  algorithms,  and  their  implementa¬ 
tion,  for  solving  a  special  class  of  large  sparse  nonlinear  equations.  The 
type  of  sparsity  occurring  in  these  problems,  which  arise  in  VLSI  de¬ 
sign,  structural  engineering  and  many  other  areas,  is  called  a  block  bor¬ 
dered  structure.  We  describe  an  explicit  method  and  several  implicit 
methods  for  solving  block  bordered  nonlinear  problems,  and  make 
a  mathematical  analysis  and  computational  comparisons  of  the  two 
types  of  methods.  Several  variations  and  globally  convergent  modifi¬ 
cations  of  the  implicit  method  are  also  presented.  We  describe  parallel 
algorithms  for  solving  block  bordered  nonlinear  equations,  and  present 
experimental  results  on  the  Intel  hypercube  that  show  the  effective¬ 
ness  of  the  parallel  implicit  algorithms.  These  experiments  include  a 
fairly  large  circuit  simulation  that  leads  to  a  multi-level  block  bordered 
system  of  nonlinear  equations. 
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1  Introduction 

1.1  Definition  of  block  bordered  nonlinear  problems 

In  this  paper  we  present  a  group  of  parallel  algorithms,  and  their  implementa¬ 
tions,  for  solving  a  special  class  of  large  sparse  nonlinear  equations,  instances 
of  which  occur  in  VLSI  design,  structural  engineering  and  many  other  areas. 
The  class  of  sparsity  occurring  in  these  problems  is  called  a  block  bordered 
structure.  In  such  a  problem  the  general  system  of  n  nonlinear  equations  in  n 
unknowns  may  be  grouped  into  g  +  1  subvectors,  xx,  ...,  x,+x  and  /1,  ...,  fq+\ 
such  that  the  nonlinear  system  of  equations  has  the  form 

f  /,(*.,  3-fl+i)  =  0)  1  =  1)  •••>  9  -i\ 

1  /,+ i(*i,  ....  2:9+1)  =  0 

where 

x,  €  Rn', 

fi  €  Rn' ,  i  —  1,  •••,  9+1, 


9+1 

J2  n«  =  n- 

«=i 

The  block  bordered  Jacobian  matrix  of  (1.1)  is 


where 


Ai  =  €  Rn'*n\  i  =  1,  9, 

ax, 


(1.2) 
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Bi  =  4^~  e  Rn,Xn ,+1’  *  =  !>  -•  9’ 

OXq+i 

C,  =  €  Rn'+*xn<,  *  =  1,  ...,  9, 

p  _  21m  g  pn«+ixn«+i# 

5i9+i 

Newton’s  method  is  the  fundamental  approach  for  solving  a  general  non¬ 
linear  system  of  equations.  Several  parallel  algorithms  for  solving  general 
systems  of  nonlinear  equations  that  are  based  upon  Newton's  method  have 
been  developed  and  implemented  on  various  parallel  computers.  These  paral¬ 
lel  algorithms  consist  mainly  of  solving  the  linear  Jacobian  system  in  parallel. 
Many  parallel  algorithms  have  been  developed  for  solving  linear  systems,  such 
as  parallel  factorizations,  parallel  SOR  methods,  parallel  red-black  methods, 
parallel  multicolor  and  others  (see  e.g.  Ortega  and  Voigt  [1985],  O’Leary  and 
White  [1985],  White  [1986],  Fontecilla  [1987],  Coleman  and  Li  [1987]). 

In  the  case  of  large  sparse  nonlinear  problems,  one  cannot  expect  a  single 
parallel  algorithm  to  handle  all  the  instances  of  the  system  of  nonlinear  equa¬ 
tions  problem  efficiently,  but  rather  the  algorithm  must  take  into  account  the 
sparsity  structure  and  other  special  characteristics  of  the  problem.  Parallel 
algorithms  taking  advantage  of  this  special  structure  may  be  much  more  ef¬ 
ficient  than  algorithms  ignoring  the  structure.  This  paper  is  an  instance  of 
developing  special  algorithms  for  a  special,  important  structure. 

1.2  Background  on  block  bordered  problems 

Block  bordered  problems  of  the  form  (1.1)  arise  in  many  areas  of  science  and 
engineering,  and  a  few  algorithms  have  been  developed  to  solve  linear  block 
bordered  systems  of  equations  efficiently.  In  applications  such  as  structural 
engineering,  large  spatial  models  may  be  divided  into  q  regions  such  that  each 
region  only  interacts  directly  with  neighboring  regions.  The  variables,  x,-,  for 
each  region,  are  chosen  so  that  the  model  can  determine  their  values,  given 
the  values  of  the  linking  variables  (the  x?+i)  at  the  boundaries  of  the  regions. 
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The  linking  variables  are  tied  together  by  a  q  +  1st  set  of  equations  repre¬ 
senting  the  interactions  between  the  regions.  Thus  the  equilibrium  equations 
for  such  a  model  will  be  of  the  form  (1.1).  In  addition,  the  Jacobian  matrix 
is  symmetric.  These  problems,  and  parallel  algorithms  for  solving  the  linear 
block  bordered  systems  that  arise  from  them,  are  discussed  in  Faxhat  and 
Wilson  [1986],  Nour-Omid  and  Park  [1986]. 

Mu  and  Rice  [1989]  study  parallel  Gaussian  elimination  for  the  block  bor¬ 
dered  matrices  arising  from  the  discretization  of  partial  differential  equations 
(PDEs).  Christara  and  Houstis  [1988][1989]  implement  a  domain  decompo¬ 
sition  spline  collocation  method,  and  a  preconditioned  conjugate  gradient 
(PCG)  method,  for  this  linear  block  bordered  system  on  both  NCUBE/7 
and  Sequent  multiprocessors. 

All  the  work  described  above  concerns  parallel  methods  for  solving  lin¬ 
ear  block  bordered  equations.  Our  research  is  to  develop,  implement,  and 
analyze  parallel  methods  for  solving  nonlinear  block  bordered  problems.  To 
our  knowledge,  no  one  has  done  similar  work.  Chan[1985]  considers  methods 
for  a  class  of  nonlinear  problems  that  includes  (1.1),  but  his  focus  is  on  ap¬ 
proximate  solutions  of  the  linearizations  of  these  equations,  and  he  does  not 
specifically  consider  block  bordered  structure. 

Block  bordered  equations  also  arise  in  VLSI  circuit  design,  where  parts 
of  the  circuits  may  be  divided  into  regions.  The  concept  of  macromodeling 
the  circuit  is  to  decompose  the  circuit  into  subcircuits  and  to  analyze  the 
subcircuits  separately  (see  Rabbat  et  al  [1979],  [1980]).  Macromodeling  of 
the  circuit  results  in  a  system  of  nonlinear  equations  of  form  (1.1).  Ax  and 
Bi  (i  =  1,  ...,  q)  in  the  Jacobian  matrix  are  usually  used  to  represent 

internal  and  input-output  variables  in  each  of  the  q  independent  subcircuits. 
The  variables  represent  voltages  and  currents.  The  bottom  block  fq+\  rep¬ 
resents  the  voltages  or  currents  between  subcircuits.  Since  each  voltage  or 
current  is  used  only  in  one  block  of  equations  /,  plus  possibly  the  bottom 
block  /,+i,  the  nonzero  columns  of  the  Bi' s  (and  A,)  are  disjoint,  meaning 
that  the  matrices  2?,,  ...  Bq  in  (1.2)  also  follow  a  block  diagonal  pattern.  In 
addition,  since  /9+i  describes  the  point-to-point  connections  of  voltage  and 
current,  it  is  a  linear  function.  The  size  of  the  function  fq+ 1  depends  on  the 
number  of  connections  among  the  subcircuits. 
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We  have  studied  parallel  methods  for  solving  block  bordered  nonlinear 
equations  extensively  from  both  theoretical  and  practical  viewpoints.  Sec¬ 
tion  2  presents  an  explicit  method  and  severed  implicit  methods  for  solving 
block  bordered  nonlinear  equations,  and  gives  some  mathematical  analysis 
and  computational  comparison  of  these  two  types  of  methods.  Section  3 
briefly  discusses  techniques  used  to  make  these  methods  globally  convergent. 
In  Section  4,  we  give  a  group  of  parallel  algorithms  for  solving  block  bor¬ 
dered  nonlinear  systems  of  form  (1.1)  which  may  be  implemented  on  both 
shared  and  distributed  memory  multiprocessors.  The  implementations  and 
experimental  results  of  these  algorithms  on  the  Intel  hypercube,  a  distributed 
memory  multiprocessor,  are  presented  in  Section  5.  Finally,  our  conclusions 
and  some  future  research  directions  are  summarized  in  Section  6. 


2  Explicit  and  Implicit  Methods 

2.1  Introduction 

There  are  two  basic  ways  in  which  Newton’s  method  can  be  applied  to  the 
nonlinear  block  bordered  system  of  equations  (1.1).  The  explicit  approach  is 
to  simply  apply  Newton’s  method  to  (1.1).  This  involves  iteratively  solving 
the  linear  system 

J(Xk)AXk  =  - F(Xk )  i  =  1,  ...,  q  (2.1) 

for  AXk ,  where  J(Xk)  is  the  Jacobian  of  F,  which  has  the  block  bordered 
structure  (1.2). 

The  pure  implicit  approach  is  to  use  each  of  the  q  systems  of  nonlinear 
equations 

/•(^ii  )  =  0,  i  =  1,  ....  q  (2.2) 

to  solve  for  x,,  given  a  fixed  value  of  xg+1.  This  means  that  each  of  the  x, 
is  implicitly  given  by  a  function  of  xq+i.  The  whole  problem  (2.2)  is  then 
equivalent  to  solving 

fq  +  \  (^"l  ("^9+1  )i  —t  Zq(Xq+l  ),  Xg  +  ]  )  =  0.  (2-3) 
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The  Jacobian  of  this  system  is  given  by 

j  _  df„+ 1  _  dfq+i  .  df, .  _t  dfi 
9xq+i  dxi  5x9+1 


I  =  1,  q 


(2.4) 


J  =  P  -  Y^CiAr'Bi  i  =  1,  q  (2.5) 

i=i 

and  we  may  solve  (2.3)  by  Newton’s  method.  We  will  be  considering  practical 
variants  of  the  implicit  method  that  do  not  calculate  x,(x?+1)  exactly. 


In  this  section  we  describe  the  explicit  and  implicit  methods  and  their 
relations  to  each  other,  and  give  some  simple  experimental  results  using  them 
on  a  sequential  computer.  These  results  give  a  preliminary  indication  of  why 
the  implicit  method  will  be  advantageous  on  parallel  computers. 


2.2  Algorithms  and  analysis  for  the  explicit  and  im¬ 
plicit  methods 

Newton’s  method  applied  to  (1.1)  in  the  explicit  method  consists  of  solving 
the  following  linear  equations  at  iteration  k  (k  —  0,  1,  ...):  from  /,(x)  =  0, 
i  1 ,  ... ,  q , 

,4, Ax*  +  BiAxkq+1  +  /,(**,  xj+1)  =  0  (2.6) 

and  from /,+i(xf,  ...,  x£,x£+1)  =  0 

Ax*  +  P Ax*+1  +  /,+1(x*,  ...,  x,*,  xj+1)  =  0.  (2.7) 

t=i 

(For  simplicity  we  are  omitting  superscripts  k  on  Ai,Bi,Ci  and  P,  but  note 
that  at  least  the  -4,’s  and  J9,’s  can  change  at  each  iteration.)  Solving  for  Ax* 
in  (2.6)  and  substituting  into  (2.7),  we  obtain 

J AxJ+l  =  -/,+,(x{ .  xj,  x$+1)  +  £c..4r7.(*‘,  x}+1)  (2.8) 

1=1 

where  J  is  given  by  (2.5).  (We  assume  for  now  that  J  and  each  .4,  is  non- 
singular.)  So 

=  *;+,  +  a*;+1  (2.9) 
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(2.10) 


can  be  determined  from  (2.8),  and 

4"  =  4  +  a*?  i  =  1, ....  q 

can  be  determined  from  (2.6). 

In  the  pure  implicit  method,  Newton's  method  is  applied  to  (2.3),  and 
gives 

J^Xq+ 1  "h  fq+l{Xl{Xq+l)l  •••!  ^(Sg+l)*  ^-fl)  =  0  (2-11) 

where  x,(x£+1)  ( i  =  1,  ...,  q )  is  implicitly  determined  by  solving  the 
nonlinear  system 

/.(*,-,  *J+1)  =  0  (2.12) 

for  x,-.  To  turn  this  into  a  practical  computational  procedure,  we  use  a  second 
(or  inner)  iterative  Newton  process  on  (2.12)  to  calculate  an  x,(x£+1)  which 
solves  (2.12)  approximately.  For  each  i  =  1,  ...,  q,  this  yields  the  inner 
iterations 

AiAx^-1  +  xj+1)  =  0  i  =  1,  ..,  q,  j  =  1,  2,  (2.13) 

Here  x*'°  =  x*,  is  the  number  of  inner  iterations,  and  Ai  =  Ai  if  it  is 
only  evaluated  once  at  the  beginning  of  each  outer  iteration,  else  it  may  be 
evaluated  up  to  7m  times.  At  the  end  of  each  inner  iteration,  we  set 

xi’3  =  x,-J_1  +  Axf’-7-1,  i  =  1,  ...,  q.  (2.14) 

When  we  exit  the  inner  iterations,  we  set 

*rf4+1)  =  4”  =  x».  (2.15) 

Then,  x,+1  is  determined  from  (2.11),  and 

¥i  =  *$+1  +  AxJ+1.  (2-16) 

The  following  theorems  show  that  the  explicit  method  and  the  implicit 
methods  just  described  are  very  closely  related. 

Theorem  1  If  fq+\  is  linear  and  only  one  inner  Newton  iteration  is  applied 
to  solve  for  xf+1  (i  =  1,  ...,  q)  in  the  implicit  method,  i.e.  /,„  =  1,  then 
for  a  fixed  k,  the  steps  Ax*+1  axe  identical  in  both  methods. 
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Proof  In  this  case  x*+1  =  x*  -  A,  1  }i{x^  *{+i),  and  /,+1  (x^+1,  ...  ,  x£+1, z}+1) 
=  /«+i(*i>  ...  *}»  *J+1)  -  n=i  CtA-lf(xlxk^  ).  Substituting  this  into 
(2.11)  gives 

jAiJ+1  =  -/,+,(*{ .  xj,  x{+1)  +  xj+l)  (2.17) 

1=0 

which  is  identical  to  the  explicit  formula  (2.8).  □ 

In  the  implicit  method  described  above,  the  steps  Ax,-  for  i  <  q  do  not 
involve  any  information  about  fq+i  °r  Ax?+j,  and  cure  not  the  same  as  in  the 
explicit  method.  For  this  reason,  the  method  is  not  one-step  quadratically 
convergent.  If  the  value  of  x*+1(z  <  q)  calculated  by  the  implicit  method  is 
corrected  after  each  iteration  to  account  for  the  change  in  x9+i,  however,  the 
implicit  method  can  be  made  closer  to  the  explicit  method  and  quadratically 
convergent.  The  problem  may  be  defined  to  find  a  correction  term  6  such 
that 

/.(x?+1  +  S ,  xJ+{)  *  0  (2.18) 

or 

/,(xf+1  +  8,  xj+1  +  AxJ+1)  «  0.  (2.19) 

Making  a  linear  approximation  to  /,  in  (2.19)  yields  the  condition 

/.(*f+\  *J+1)  +  A,8  +  Bi Ax£+1  =  0.  (2.20) 

The  correction  term  8  obtained  from  (2.20)  would  then  be 

6  =  xj+1)  +  B, Ax^J.  (2.21) 

However,  after  /,„  inner  iterations  of  solving  for  xf+1,  /,(x,  ,  r(+])  ss  0. 

Thus  we  maJke  a  further  approximation  giving  the  correction  term 

<5,-  =  -AT'Bi  AxJ+1.  (2.22) 

We  name  the  implicit  method  with  this  correction  term  added  to  each  x*+1 
after  Ax*+1  is  calculated  the  corrected  implicit  method.  The  cost  of  the 
correction  term  (2.22)  is  small  since  the  matrices  A~lB{  ( i  =  1,  ...,  q )  have 
been  calculated  already  and  in  a  parallel  implementation,  the  matrix-vector 
products  can  be  parallelized  fully. 
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We  can  now  see  that  when  /9+1  is  linear  the  explicit  method  is  a  special 
case  f  the  corrected  implicit  method. 

Theorem  2  If  /9+1  is  linear  and  only  one  Newton  iteration  is  applied  to 
solve  for  ( i  =  1,  ...,  q)  in  the  implicit  method,  i.e.  /,n  =  1,  and 
the  system  is  corrected  by  adding  —  Ax?+i  to  xf+1(i  <  q)  after  each 
iteration,  then  the  explicit  method  and  implicit  method  are  identical. 

Proof  From  Theorem  1,  Ax£+1  is  identical  for  the  two  methods.  Combining 
(2.14)  with  j  =  1  and  (2.22)  gives: 

*?+1  ---  xf  -  A-'[y;(x‘,  *{*>>  -  B,Ax;+1]  (2.23) 

which  is  identical  to  (2.6)  in  the  explicit  method.  □ 

Corollary  If  /,+j  is  linear,  the  corrected  implicit  method  with  =  1 

inner  iterations  per  outer  iteration  is  locally  quadratically  convergent  to  the 
solution. 

Quadratic  convergence  can  also  be  shown  for  >  1  and  when  /9+1  is 
nonlinear.  The  proof  is  given  in  Zhang[l989]. 


2.3  Some  experiments  on  a  sequential  processor 

The  previous  subsection  shows  that  a  variant  of  the  implicit  method  is  equiv¬ 
alent  to  the  explicit  method,  but  doesn't  indicate  why  the  implicit  method 
might  be  preferred.  The  main  reason  turns  out  to  be  that,  by  using  more  than 
one  inner  iteration  per  outer  iteration  in  the  implicit  method,  the  number 
of  outer  iterations  can  be  reduced  substantially,  which  is  advantageous  espe¬ 
cially  for  parallel  computation.  In  this  subsection  we  give  a  first  indication 
of  the  sort  of  computational  behavior  that  we  hi  ve  found. 

We  initially  tested  the  methods  discussed  in  this  section  on  several  arti¬ 
ficial  problems.  Here  we  report  results  on  a  simple  20  x  20  nonlinear  block 
bordered  system  of  equations  which  has  four  4x4  blocks,  Ai,  ...  ,  A4,  and 
a  4  x  4  bottom  block  P,  and  fq+\  linear.  In  all  cases,  the  starting  value  of  1 
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was  close  to  the  solution,  and  no  global  strategy  (e.g.  line  search)  was  used. 
All  these  experiments  were  run  on  Pyramid  P90  computer.  The  equations 
are  given  in  Zhang[1989|. 

First,  we  compare  the  performance  of  the  three  methods  when  only  one 
inner  iteration  (/,•„  =  1)  is  used  in  the  uncorrected  implicit  and  corrected 
implicit  methods  (Table  2.1).  The  explicit  method  and  the  corrected  implicit 
method  with  7,n  =  1  are  identical  in  this  case  (see  Theorem  2).  Thus,  the 
same  number  of  iterations  are  required  to  converge  to  the  solutions.  The 
computing  times  are  slightly  different  since  the  implementations  of  the  two 
methods  are  different.  The  uncorrected  implicit  method  converges  a  little  bit 
slower  than  the  other  two  methods,  which  is  reasonable  since  the  correction 
step  is  needed  to  make  it  one-step  quadratically  convergent. 


Table  2.1:  Experiments  with  the  Three  Methods 


(I in  =  1)  outer  iterations  (seconds) 

explicit 

implicit 

corrected  implicit 

13  (0.44) 

14  (0.40) 

13  (0.4  j) 

Next  we  increased  the  number  of  inner  iterations  in  the  uncorrected  and 
corrected  implicit  methods.  The  experimental  results  (Tables  2.2  and  2.3) 
show  that  the  number  of  outer  iterations  is  sharply  decreased  when  the  num¬ 
ber  of  inner  iterations  is  2.  However,  the  number  of  outer  iterations  decreases 
more  slowly  as  increases  further.  There  exists  an  optimal  value  of  /,„  for 
computing  time  in  both  methods,  but  it  is  problem  dependent.  Our  exper¬ 
iments  also  show  that  the  corrected  implicit  method  converges  a  little  bit 
faster  than  the  uncorrected  implicit  method  when  /,„  >  1,  which  is  consis¬ 
tent  with  our  convergence  analysis.  In  Section  5  we  will  see  that  for  larger 
problems,  the  improvements  in  time  for  the  corrected  implicit  method  with 
I in  >  1  can  be  considerably  larger  than  those  seen  here.  Also,  we  will  see 
in  Sections  4  and  5  that  the  decrease  in  outer  iterations  is  advantageous  for 
parallel  computation. 
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Table  2.2:  Experiments  with  the  (Uncorrected)  Implicit  Method 


(/in 

>  1)  outer  iterations  (seconds) 

/in  =  1 

IflU 

ESI 

KSB9 

14  (0.40) 

8  (0.34) 

7  (0.40) 

6  (0.44) 

6  (0.54) 

Table  2.3:  Experiments  with  the  Corrected  Implicit  Method 


(/in 

>  1)  outer  iterations  (seconds) 

/ in  =  1 

ESB1 

mi 

■/■LgM 

169^^3 

13  (0.40) 

8  (0.38) 

6  (0.36) 

6  (0.50) 

5  (0.54) 

3  Globally  Convergent  Modifications  of  the 
Explicit  and  Implicit  Methods 


The  explicit  method  and  corrected  implicit  method  are  locally  quadratically 
convergent  to  the  solution  of  (1.1).  In  other  words,  when  the  initial  solution 
approximation  is  good  enough,  these  methods  are  guaranteed  to  converge 
rapidly  to  a  root  of  (1.1).  However,  it  is  often  hard  to  find  a  good  initial 
approximation  for  nonlinear  systems  of  equations  in  practice.  In  addition, 
many  practical  problems,  such  as  the  circuit  equations,  are  highly  nonlinear, 
and  if  the  current  solution  approximation  is  not  close  enough,  a  Newton  step 
may  easily  result  in  an  increase  in  the  function  norm.  For  example,  a  small 
change  in  some  voltage  difference  in  a  nonlinear  circuit  equation  may  result 
in  a  great  change  in  an  exponential  term  in  a  diode  or  transistor’s  function 
evaluation.  Also,  many  block  bordered  equations  result  in  singular  or  nearly 
singular  Jacobians  in  the  process  of  the  iterations,  for  example  because  a 
transistor  with  an  exponential  model  is  turned  on  at  a  nearly  flat  function 
curve  (see  Zhang,  Byrd,  Schnabel  [1989]). 
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For  these  reasons,  the  explicit  and  implicit  methods  need  to  be  modified 
to  handle  unacceptable  steps  and  singular  Jacobian  matrices  in  order  to 
converge  to  a  solution  from  poor  starting  points.  In  this  section  we  briefly 
describe  the  modifications  we  have  used,  which  are  motivated  in  part  by  their 
appropriateness  for  parallel,  distributed  computation.  They  are  described  in 
more  detail  in  Zhang  [1989].  A  global  convergence  analysis  will  be  given  in 
a  forthcoming  paper. 


We  achieve  convergence  from  poor  starting  points  by  using  a  line  search. 
The  explicit  method  is  just  a  standard  Newton’s  method,  so  we  can  use  a 
standard  line  search.  That  is,  we  calculate  the  overall  step  direction  dk  = 
(Ax i . . .  Axk,  Ax£+1)  as  described  in  Section  2.2,  and  then  set 

Xk+i  _  Xk  +  Xkdk 

where  the  steplength  parameter  Xk  >  0  is  chosen  by  a  line  search  procedure 
that  assures  sufficient  descent  on  (|  F(x)  ||2.  Our  line  search  is  based  upon 
Algorithm  A6.3.1  in  Dennis  and  Schnabel  [1983]. 


The  implicit  method  is  more  complicated  since  we  have  both  inner  and 
outer  iterations.  We  need  to  choose  the  steps  Axf’J  =  (x,fc’-7+1  —  xfJ)  in  the 
inner  iterations  so  that  the  overall  step  direction 


d”  =  (E'LV  Axf'j  +  if,  +  if,  Axf+1), 


(3.1) 


where  6k  is  the  correction  step  (2.22),  is  a  descent  direction  on  ||  F(x)  ||2.  In 
addition,  we  would  like  the  calculation  of  the  steps  AxfJ  for  different  values 
of  i  to  be  independent,  so  that  the  calculations  of  the  inner  iterations  can  be 
parallelized  easily  and  efficiently. 


Zhang  [1989]  shows  that  if  each  Aij'’-7  is  calculated  by 

AxF  =  -A  (3.2) 

(i.e.  the  step  discussed  in  Section  2.2  multiplied  by  a  line  search  parameter 
A,*1,7  >  0),  the  corrections  steps  Sk  are  calculated  by  (2.22),  Ax*+1  is  calculated 
by  (2.11)  as  before,  and  fq+i  is  linear,  then  dk  given  by  (3.1)  satisfies 
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](xk)dk  = 
(S&T'A 


• ,  sfsr1  *j+l  ),/,+i  (*{, 


From  this,  it  is  straightforward  to  show  that  a  sufficient  condition  for  dk  given 
by  (3.1)  to  be  a  descent  direction  on  ||  /( x)  j|2  is  that  for  each  i  =  1 

ZjSo'WM*?, *Ui)  <  0.  (3.3) 

Note  that  (3.3)  is  always  true  for  Ii„  =  1  (since  each  Af’°  >  0),  and  that  it  is 
true  for  =  2  if  ||  /(xf’1,  x£+1)  ||<||  /(xf’°,x*+1)  j|  (which  any  line  search 
will  enforce)  and  Af’1  <  A*’0.  Since  (3.3)  can  be  monitored  independently 
for  each  i,  the  following  strategy  will  guarantee  that  a  descent  direction  is 
generated.  For  each  j,  the  procedure  calculates  each  Ax*’-7  by  (3.2)  using 
a  standard  line  search  as  mentioned  above,  and  then  checks  whether  the 
corresponding  partial  sum  of  (3.3)  is  satisfied.  If  it  is  not,  it  sets  Ax1!'1  =  0 
for  /  =  j, . . . ,  7in  —  1  and  exits  the  inner  iteration  for  x,.  The  outer  line  search 
can  be  performed  as  in  the  explicit  method. 


Our  approach  for  dealing  with  (nearly)  singular  Jacobians  is  based  upon 
the  Levenberg-Marquardt  approach  as  described  in  Dennis  and  Schnabel 
[1983].  For  a  general  system  of  nonlinear  equations,  if  the  current  Jaco¬ 
bian  matix  J  is  (nearly)  singular,  this  approach  modifies  the  search  direction 
to  be  —(JTJ  +  /i/)-1  JTF,  where  F  is  the  current  function  value,  and  p  is 
a  small  positive  number.  This  direction  is  a  descent  direction  on  ||  F(x)  ||2 
and  is  the  solution  to  the  trust  region  problem 

minimize  y  f  +  Jd  |j2  subject  to  y  d  y2  <  A 

for  some  A  >  0.  In  the  limit  as  \i  — ►  0,  this  direction  equals  —  J+F,  where 
J+  is  the  pseudoinverse  of  J. 


In  the  explicit  and  implicit  methods  described  in  Section  2,  we  need 
to  solve  systems  of  linear  equations  using  the  matrices  and  the 

matrix  J  =  P  —  Y,qi=lCiAjl Bi.  If  any  of  these  matrices,  say  M,  is  nearly 
singular  (i.e.,  either  the  factorization  detects  numerical  singularity  or  the 
estimated  condition  number  of  M  is  greater  than  macheps -1/2)  we  simply 
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replace  M-1  by  ( MT M  -f  nl)~xMT  in  the  formulas  as  of  Section  2,  where  /x 
is  chosen  by  the  strategy  suggested  by  Dennis  and  Schnabel  [1983]  and  is  a 
function  of  M.  These  perturbations  again  have  interpretations  in  terms  of 
trust  regions.  Note  also  that  the  algorithms  for  deciding  whether  to  perturb 
each  A\, . . . ,  Aq,  and  for  perturbing  them  if  necessary,  are  totally  independent 
so  that  they  can  be  performed  in  parallel. 

Combining  these  perturbation  techniques  with  the  inner  line  searches  in 
a  way  that  assures  descent  at  the  outer  iteration  and  global  convergence  is 
somewhat  more  complex,  and  will  be  addressed  in  Byrd,  Feng,  Schnabel, 
and  Zhang  [1990].  In  our  implementations,  we  have  simply  taken  inner 
iterations  for  each  block  i,i  =  1, . . . ,  q.  We  have  used  a  standard  line  search 
to  choose  each  A*’'7  (requiring  sufficient  descent  on  /,)  but  have  not  checked 
a  condition  like  (3.3)  that  assures  global  descent.  To  our  knowledge,  the 
algorithm  has  still  always  produced  a  descent  direction. 

The  algorithm  we  implement  is  summarized  below.  If  J  or  any  A,  below 
is  (nearly)  singuiar,  J-1  or  A~l  is  replaced  by  ( jTJ  -f  nI)~ljT  or  {Af  A,  + 
HiI)~lAj  for  a  small  positive  /x  or  /x,,  respectively.  When  /9+1  is  linear,  the 
explicit  method  is  just  a  special  case  with  /,„  =  1  and  each  Af’1  =  1. 

Implicit  Method  with  Global  Modification 

1.  For  j  =  0, . . .  I in  —  1,  calculate  xf'',+1  =  x*’-7  —  A*’-,.4,-1/(x,fc,,?  ,  x£+1)  where 

>  0,*  =  1,...,?. 

2.  Form  and  factor  J  =  P  —  E’=1C,.4,-1  £?,. 

3.  Calculate  Ax£+1  =  -j~lfq+ j(xi+1, . . . ,  x£+1,  x*+1)  where  x*+1  =  xf'/,n. 

4.  Calculate  the  corrections  6*  =  —  Af1  BiAx*+l  and  set  xf+l  =  xf+1+6f ,  i  = 

1, 

5.  Calculate  Xk+1  =  Xk  —  \kdk  where  dk  =  (xj+1  — xf, ...  ,x£+1  — xj,  Ax£+1). 
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4  Parallel  Explicit  and  Implicit  Algorithms 


4.1  Motivation  -  LU  factorization  of  block  bordered 
linear  equations 

Note  that  the  LU  factorization  of  the  block  bordered  Jacobian  matrix 


is 


l  u 


L2 


(M 

I  Aa 


M 

Ba 


A,  Bq\ 

\C,  C2  .  .  Cq  P  ) 


\  (  ux 


U 2 


A  ) 

b2 


\  Cl  c2  .  .  c\  Lq+1  )  [ 

where  for  i  —  1,  q, 

At  =  LXU, 

Bi  =  L~'Bt 

c,  =  Qur1 

and 

Lq+iUq+1  =  J  =  P  -  ^2  C(A~lBi  =  P 


Uq  Bq 

tW  ) 


A  A 

(This  is  the  same  matrix  J  as  in  Section  2.)  The  calculations  of  Z,,  B,, 

and  Ci  for  each  i  are  independent,  and  thus  can  be  computed  very  efficiently 
in  parallel.  The  factorization  of  J  must  follow  these  calculations  and  will  not 
parallelize  as  efficiently,  especially  on  distributed  memory  multiprocessors, 
because  it  will  require  considerable  communication. 
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A  parallel  version  of  the  explicit  method  essentially  consists  of  performing 
the  above  factorization  in  parallel  at  each  iteration.  The  parallel  version  of 
the  implicit  method  that  we  discuss  next  will  be  seen  to  perform  closely 
related  operations.  The  major  difference  will  be  that,  by  performing  more 
than  one  inner  iteration  per  outer  iteration,  it  will  spend  a  larger  portion  of 
its  time  on  the  calculations  that  parallelize  very  efficiently,  those  for  blocks 
1,  ...,  q ,  and  a  smaller  portion  of  its  time  on  the  calculations  that  parallelize 
less  well,  the  formation  and  factorization  of  J  and  the  outer  line  search. 
Thus  the  implicit  method  can  be  expected  to  parallelize  more  effectively 
than  the  explicit  method,  especially  on  distributed  memory  computers.  If 
the  two  methods  require  similar  amounts  of  time  on  sequential  computers, 
as  indicated  in  Section  2,  then  the  implicit  method  can  be  expected  to  be 
faster  on  parallel  computers. 

4.2  Parallel  Algorithms 

Below  we  give  a  general  description  of  a  parallel  corrected  implicit  method 
that  is  based  upon  the  sequential  method  presented  in  Sections  2  and  3. 
The  parallelism  comes  mainly  from  executing  all  the  operations  on  blocks  1 
through  q,  which  have  been  designed  to  be  independent,  concurrently.  The 
parallel  explicit  method  is  just  the  special  case  with  =  1  and  no  inner 
line  search. 

Inner  Iterations 

1.  For  i  =  1,  <7,  Do  in  parallel: 

1.1  Factor  A,  and  estimate  its  condition  number  Cond(Ai) 

1.2  If  Cond(Ai)  <  Tol  then  set  Mi  =  A,-,  N,  =  I 

Else  choose  >  0,  form  and  factor  A/,  =  AfA,-  +  /jtI, 
set  Ni  =  Aj 

1.3  For  j  =  0,  I in  —  1,  Do: 

Solve  Af.Axf’7  =  — Ar,/,(xlfc,7,x*+1)  for  Axf'7 

Inner  line  search:  xf’7+1  =  xf'7  -f  A f’7Axf’7  for  some  A *’7  >  0 

1.4  Solve  MiWi  =  NiBi  for  IV, 

1.5  Calculate  7i  =  C,IV, 


1G 


Outer  Iteration 


*2.  Form  J  =  P  -  ZUi  T> 

*3.  Factor  J  and  estimate  its  condition  number  Cond(j) 

*4.  If  Cond(J)  <  Tol  then  set  M  =  J ,  N  =  / 

Else  choose  \i  >  0,  form  and  factor  M  =  JTJ  4-  /i/,  set  A:  =  Jr 

*5.  Solve  MAxk+1  =  -N  fq+1{xk'I,n ,  ...,  x£-;*",x£+1)  for  Ax}+1. 

6.  Fori  =  1,9,  Do  in  parallel: 

Calculate  corrections  6k  =  —  iy,  Ai*+1  and  set  x*+1  =  xk'Iin  +  6* 

*7.  Outer  line  search:  xk+1  =  xk  4-  Afc(xJ+1  —  xf,  xk+1  —  xk,Axk+1) 
for  some  \k  >  0. 


The  steps  marked  with  stars  requires  synchronization  (on  a  shared  mem¬ 
ory  multiprocessor)  or  communication  (on  a  distributed  memory  multiproces¬ 
sor).  Step  2  requires  synchronization  if  the  matrices  T,  are  full.  In  the  VLSI 
problems,  however,  the  nonzero  columns  of  Bi,  and  hence  T),  are  disjoint  (see 
Section  1.2)  and  hence  step  2  can  be  performed  efficiently  in  parallel. 

On  shared  memory  machines,  steps  3-5  can  be  performed  in  parallel  using 
standard  parallel  methods  for  solving  linear  equations.  On  a  distributed 
memory  machine,  it  will  only  be  efficient  to  perform  steps  3-5  in  parallel  if 
the  dimension  of  J  is  rather  large.  In  our  test  problems,  J  was  fairly  small, 
so  we  performed  steps  3-5  on  one  processor,  on  which  we  kept  P,  J,  and 
x,+i.  The  remaining  data  was  distributed  in  the  obvious  way:  /!,,  5,,  C,, 
and  x,-  were  stored  together  on  the  processor  that  handled  block  i.  Step 
7  includes  two  main  operations,  the  calculation  of  trial  points  xk+1  and  the 
evaluations  of  F  at  the  points,  that  are  performed  in  parallel  on  a  shared 
memory  machine,  and  may  be  performed  sequentially  or  in  parallel  on  a 
distributed  memory  machine  depending  on  their  costs  relative  to  the  cost  of 
communication. 
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5 


Experimental  Results  on  a  Hypercube  Mul¬ 
tiprocessor 

5.1  The  test  problem:  a  nonlinear  block  bordered 
circuit  equation 

The  nonlinear  block  bordered  application  we  considered  for  testing  our  meth¬ 
ods  is  the  VLSI  circuit  simulation  problem.  Standard  circuit  simulation 
methods  consist  of  stiffly  stable  implicit  integration  formulae  to  discretize 
the  differential  equations,  Newton’s  method  to  solve  the  resulting  nonlinear 
algebraic  equations, 

F(  X)  =  0,  (5.1) 

and  sparse  LU  decomposition  to  solve  the  linear  equations  that  arise  at  each 
iteration  of  Newton’s  method, 

JAX  =  -F(X),  (5.2) 

where  J  6  RnXn  is  the  Jacobian  matrix  of  (5.1).  Typically,  less  than  2 
percent  of  the  entries  of  J  are  nonzero  for  n  >  500  (see  e.g.  Sangiovanni- 
Vincentelli  and  Webber  [1986]).  The  Newton  iteration  for  (5.1)  is  repeated 
until  a  root  is  found  or  the  upper  bound  on  the  number  of  iterations  is 
reached.  The  program  then  decides  whether  to  accept  the  solution,  based  on 
its  estimate  of  local  truncation  error  and  the  number  of  iterations  required. 

As  mentioned  in  Section  1.2,  partitioning  the  circuit  leads  to  a  block  bor¬ 
dered  system  of  nonlinear  equations  of  the  form  (1.1)  (see  e.g.  Rabbat  et 
al  [1979]).  Given  a  circuit  network  T,  a  group  of  partitioned  subnetworks 
7,,  i  =  1,  ...,  q,  and  the  connecting  current  and  voltage  equations,  the 

block  bordered  nonlinear  system  of  equations  is  defined  as  follows.  Currents 
between  two  subnetworks  and  voltages  at  the  boundary  are  each  represented 
by  two  variables,  one  in  each  subnetwork,  which  are  set  equal  to  each  other 
by  the  equations  of  /,+1.  Variables  Xi  (i  =  1,  ...,  q )  are  used  to  rep¬ 

resent  internal  voltage  and  current  variables  in  each  of  the  q  independent 
subnetworks.  They  also  include  the  current  connecting  variables  among  the 
q  subnetworks.  The  variable  x,+i  is  used  to  represent  the  voltage  connect¬ 
ing  variables  among  the  q  subnetworks.  Equations  /,  represent  the  current 
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equations  for  subnetwork  7,-  and  its  connecting  variables.  These  equations 
for  voltages  and  currents  are  standard  current  equations  involving  resistors, 
transistors,  diodes,  voltage  sources  and  other  elements.  Since  the  connecting 
equation  /,+ 1  is  linear,  the  coefficient  matrices  C{,  i  =  1,  q  for  the 

current  connecting  functions  are  constant,  and  the  coefficient  matrix  P  for 
the  voltage  connecting  function  is  also  constant. 

For  a  very  large  circuit,  the  network  T  may  be  divided  into  subnetworks 
recursively,  which  leads  to  a  multi-level  block  bordered  system  of  nonlinear 
equations.  In  such  a  case,  the  diagonal  blocks  Ai,  ( i  =  1,  ...,  q)  are 

themselves  block  bordered  matrices.  The  border  elements  of  the  multilevel 
system  represent  the  connections  of  the  highest  level. 

We  first  applied  our  algorithm  to  the  circuit  shown  in  the  figure  in  Ap¬ 
pendix  A.  It  is  the  741  op-amp  circuit  (see  e.g.  Sedra  and  Smith  [1982]), 
•which  was  introduced  in  1966  and  is  currently  produced  by  almost  every 
analog  semiconductor  manufacturer.  The  circuit  is  partitioned  into  4  parts 
with  a  roughly  equal  number  of  nodes  in  each  sub-circuit.  A  transistor  is 
viewed  as  a  nonlinear  three  terminal  device  in  the  circuit.  Thus,  apply¬ 
ing  the  Ebers-Moll  transistor  model  (see  Ebers  and  Moll  [1954]),  24,  27, 

23  and  27  KCL  functions  are  defined  in  the  1st,  2nd,  3rd  and  4th  block 
respectively.  The  7  connections  among  the  4  blocks  result  in  14  linear  cur¬ 
rent  and  voltage  connecting  functions.  The  total  number  of  variables  is 

24  +  27  +  23  +  27  +  14  =  115.  The  structure  of  the  block  bordered 
Jacobian  matrix  and  the  linear  connecting  border  is  shown  in  Appendix  B. 

We  also  applied  our  algorithm  to  a  large  analog  filter,  shown  in  Appendix 
C,  that  is  composed  of  three  741  op-amp  circuits  (see  e.g.  Smith  [1971], 
Valkenburg  [1982]).  This  circuit  leads  to  a  2-level  block-bordered  nonlinear 
system,  as  follows.  The  analog  filter  is  first  partitioned  into  3  parts,  each 
of  which  contains  one  741  op-amp  circuit.  The  first  level  block  bordered 
structure  is  thus  formed  with  3  diagonal  blocks  and  1  connecting  block.  Each 
of  the  diagonal  blocks  is  a  741  op-amp  circuit  which  is  partitioned  into  the 
second  level  block  bordered  structure. 
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5.2  The  741  op-amp  Circuit  Simulation  on  the  Intel 
Hypercube 

The  nonlinear  block  bordered  equations  of  the  741  op-amp  circuit  were  solved 
in  parallel  on  an  Intel  iPSCl  hypercube  using  the  algorithm  of  Section  4.2. 
The  4  blocks  of  the  circuit  were  distributed  among  4  nodes  of  the  hypercube. 
For  convenience,  the  steps  involving  the  connection  function  fq+x  (steps  3-5 
of  the  parallel  algorithm)  were  performed  on  a  different  node  which  plays 
the  control  role.  They  could  just  as  well  have  been  done  on  one  of  the 
4  nodes.  Identical  initial  values  were  used  as  the  inputs  for  all  the  above 
experiments,  and  the  convergence  tolerances  were  also  the  same  for  those 
experiments.  The  solutions  of  the  experiments  were  verified  by  comparing 
them  to  the  solutions  computed  by  the  program  SPICE  which  is  a  general- 
purpose  circuit  simulation  program  for  nonlinear  dc,  nonlinear  transient,  and 
linear  ac  analysis  (see  Newton,  Pederson  and  Sangiovanni-Vincentelli  [1988]). 

Tables  5.1  and  5.2  show  the  experimental  results  for  the  explicit  method. 
Tables  5.3  and  5.4  list  the  experimental  results  for  the  corrected  implicit 
method  with  one  or  more  than  one  inner  iterations  per  outer  iteration  and 
with  inner  line  searches.  Note  that  as  long  as  the  inner  line  search  is  ap¬ 
plied,  the  corrected  implicit  method,  even  with  one  inner  iteration  per  outer 
iteration,  is  not  the  same  as  the  explicit  method. 

In  Tables  5.1  and  5.3,  T,  (i  =  1,  ...,  4)  is  the  total  computing  time 

for  all  computations  involving  the  ith  diagonal  block  on  node  i ,  Jj,  is  the 
total  computing  time  for  all  computations  involving  the  bottom  block  on  the 
control  node,  Tc  is  the  total  communication  time  for  the  computation,  Nout  is 
the  total  number  of  outer  iterations  required  to  converge  to  the  solution,  and 

in  Table  5.3  is  the  number  of  inner  iterations  used  in  the  corrected  implicit 


Table  5.1:  Times  for  the  Explicit  Method  on  the  op-amp  741  Circuit 


■1 

mm 

mm 

mm 

Tb 

mm 

N0u1 

12.81 

14.20 

13.45 

14.58 

3.42 

0.43 

20 

20 


Table  5.2:  Parallel  Performance  of  the  Explicit  Method  on  the  op-amp  741 
Circuit 


Ts 

TP 

sp 

eff 

58.46 

18.43 

3.14 

78.5% 

Table  5.3:  Times  for  the  Corrected  Implicit  Method  on  the  op-amp  741 
Circuit 


/,„ 

T\ 

t2 

t3 

t4 

Tb 

Tc 

A  out 

1 

11.81 

13.06 

11.96 

13.28 

2.84 

0.38 

18 

2 

12.60 

14.01 

12.83 

14.45 

1.65 

0.32 

15 

3 

11.28 

12.84 

11.46 

13.01 

1.38 

0.25 

12 

4 

18.48 

20.57 

18.81 

21.34 

1.32 

0.25 

11 

5 

29.04 

32.12 

29.92 

32.89 

— 

1.34 

0.24 

11 

Table  5.4:  Parallel  Performance  of  the  Corrected  Implicit  Method  on  the 
op-amp  741  Circuit 


m 

T. 

mm 

sp 

eff 

i 

52.95 

16.5 

3.21 

80.25% 

2 

55.54 

16.42 

3.38 

84.50% 

3 

49.97 

14.64 

3.43 

85.75% 

4 

80.52 

22.91 

3.50 

87.50% 

5 

125.31 

34.47 

3.60 

90.0% 
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method.  In  the  performance  Tables  5.2  and  5.4,  T,  is  the  total  computing 
time  for  solving  the  same  problem  sequentially  on  one  node, 

T,  =  j^Ti  +  Jl, 

i=i 

Tp  is  the  parallel  computing  time, 

Tp  —  max{T\ ,  TI4)  +  Tj,  4-  Tc, 


sp  is  the  speedup  of  the  parallel  computation  in  comparison  to  its  sequential 
counterpart, 

T. 


sp  =  =r, 

Jp 

and  eff  is  the  parallel  efficiency  defined  by 

sp 


eff  = 


number  of  processors 


Our  experiments  show  that  using  the  implicit  method,  with  inner  line 
search  and  multiple  inner  iterations  per  outer  iteration,  does  indeed  speed 
up  the  convergence  to  the  solution.  For  example,  the  implicit  method  with 
one  inner  iteration  per  outer  iteration  and  inner  line  search  used  18  iterations 
to  converge  to  the  solution.  The  same  algorithm  without  inner  line  search,  i.e. 
the  explicit  method,  used  20  iterations.  As  the  number  of  inner  iterations  per 
outer  iteration,  was  increased,  the  total  number  of  outer  iterations 
decreased  from  18  to  11,  and  the  speedup  increased  because  the  bottom  block 
computations  constituted  a  smaller  percentage  of  the  overall  computation. 
However,  the  sequential  computing  time  only  decreased  by  a  small  amount 
until  I in  =  3,  and  then  increased  dramatically  because  the  cost  of  the  extra 
inner  iterations  swamped  the  small  savings  from  the  further  decrease  in  the 
number  of  outer  iterations.  Both  the  sequential  and  parallel  computing  times 
were  minimized  when  the  number  of  inner  iterations  was  =  3,  and  the 
speedup  in  this  case,  3.43,  was  good.  The  parallel  method  for  this  case  costs 
21%  less  than  the  parallel  explicit  method. 

Our  experiments  also  show  that  the  bottle-neck  computing  time  Tj,  of 
the  corrected  implicit  method  is  more  than  50%  lower  than  in  the  explicit 
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method  if  more  than  one  inner  iteration  is  applied.  The  communication  time 
is  also  lower  since  the  total  number  of  iterations,  and  hence  the  time  to  send 
the  updated  variables  among  the  nodes,  is  less  than  for  the  explicit  method. 
Consequently,  the  advantage  of  the  implicit  method  over  the  explicit  method 
should  be  greater  in  larger  problems  where  the  amount  of  communication 
and  cost  of  solving  the  bottom  block  are  larger. 


5.3  Experiments  for  2-level  Block  Bordered  Circuit 
Equations 

The  2-level  nonlinear  block  bordered  equations  for  the  analog  filter  composed 
of  3  blocks  of  the  741  op-amp  circuit  were  also  solved  in  parallel  on  the  Intel 
iPSCl  hypercube  multiprocessor.  The  linearization  of  these  equations  at 
each  iteration  has  the  form 

JAX  =  -F  (5.3) 

where  J  is  the  2-level  block  bordered  matrix  shown  in  Appendix  D, 

AX  (Axj,  ...,  Ai?,  Ax?+1,  Axj,  ...,  Ax?,  Ai?^j,  Axj,  ...,  Ai?, 
Ax^+1,  Ai?+1)T 


and 


f  =  (A1, /},  y}+1,  fl 


f 


f2 

J  9  +  1  * 


f3 
j  i  > 


/3. 

Jq 


f3 

jq  + 1> 


fq+ 1)1 


The  system  (5.3)  could  be  solved  by  applying  the  block  bordered  solver 
to  each  of  the  3  block  bordered  submatrices,  and  then  solving  the  whole 
system  by  applying  the  block  bordered  solver  again.  Alternatively,  the  block 
bordered  Jacobian  matrix  may  be  reordered  to 
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Mi 


4 


A\ 


Cl  .  C] 


Cl 


Cl 

and  AX  reordered  to 


Al 


A\ 


cl 


cl 


B\ 


B] 


B l 


Bl 


Bi 


B2 


Bl 


:  Cl 


P 1 


.  B3 

Bl 


P 2 


P3 


c3 


p  J 


AX  =  (Ax},  Ax],  A x\,  \x },  Ail,  ...,  Ax], 
Axv+J,  Ax„j.j.  Axq+1,  Aij+j)7" 

and  F  reordered  to 


f  =  (fl ....  yj,  y?,  ....  n 


r3 

' 


/3 

J 1 


fl 


9+1  ’ 


f 2 

-'g+H 


Z-3 

y9+n 


fq+i)1 


For  solving  this  block  bordered  system,  2  levels  of  parallelism  can  be  ex¬ 
ploited.  Let  m  be  the  number  of  amplifiers  in  the  analog  filter  and  q  the 
number  of  sub-circuits  inside  each  amplifier;  here  m  =  3,  5  =  4.  First  the 
m  x  q  independent  operations  for  solving  the  diagonal  blocks  can  be  per¬ 
formed  in  parallel.  Secondly,  the  m  independent  operations  for  transforming 
the  matrices  P-7,  j  =  1,  ...,  m,  and  solving  the  resultant  systems  of  equa¬ 
tions  can  be  performed  in  parallel.  Finally,  the  very  bottom  block,  with  the 
matrix  P,  must  be  transformed  and  solved.  This  is  the  approach  that  we 
took. 
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Table  5.5:  Times  for  the  Explicit  Method  on  the  Analog  Filter 


wm 

mm 

mm 

wsm 

mm 

mm 

warn 

mm 

■i 

13.49 

14.93 

14.21 

15.34 

13.44 

14.85 

14.16 

15.46 

13.36 

mm 

mm 

T 1 

T 2 

T3 

Tb 

mm 

iVou, 

14.76 

14.25 

15.63 

3.01 

3.12 

3.17 

0.58 

1.31 

21 

Table  5.6:  Parallel  Performance  of  the  Explicit  Method  on  the  Analog  Filter 


T , 

mm 

sp 

e// 

183.76 

20.69 

8.88 

74.00% 

In  our  test  program,  the  12  diagonal  block  equations  of  the  analog  filter 
were  distributed  among  12  nodes  of  the  Intel  hypercube.  The  first  level  or 
internal  connection  functions  in  each  amplifier,  /^+1  (j  =  1,  3)  were 

distributed  to  3  of  the  12  nodes,  and  the  second  level  connection  function 
among  the  3  amplifiers  in  the  analog  filter,  /,+i,  was  handled  sequentially  by 
one  of  the  12  nodes. 

Tables  5.5  and  5.6  give  the  experimental  results  and  the  performance  of 
the  explicit  method  for  solving  the  2-level  analog  filter  equation.  Tables  5.7- 
5.11  show  the  experimental  results  and  performance  of  the  corrected  implicit 
method  for  solving  the  2-level  block  bordered  analog  filter  equations  with  one 
to  four  inner  iterations.  The  symbols  in  the  tables  have  the  same  meanings 
as  in  Tables  5. 1-5. 4,  with  the  following  exception:  71,-,  i  =  1,  ...,  12  is  the 
total  time  for  the  12  first  level  blocks,  while  T\  j  =  1,2,3  is  the  time  for 
the  3  second  level  blocks. 

Our  experimented  results  show  that  the  corrected  implicit  method  is  also 
more  efficient  than  the  explicit  method  on  this  larger  block  bordered  sys¬ 
tem  of  equations.  The  total  number  of  iterations  Nout  is  21  for  the  explicit 
method,  while  for  the  implicit  method  it  decreeises  from  18  to  11  as  the  num- 
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Table  5.7:  Corrected  Implicit  Method  Times  for  the  Analog  Filter,  /,„  = 


mm 

mm 

mm 

mm 

mm 

mm 

mm 

mm 

m 

11.54 

12.78 

12.13 

13.13 

11.52 

12.69 

12.10 

13.21 

11.42 

TOM 

mm 

mm 

T 2 

T 3 

Tk 

mm 

N out 

12.61 

12.20 

13.40 

2.56 

2.67 

2.73 

0.5 

1.12 

18 

Table  5.8:  Corrected  Implicit  Method  Times  for  the  Analog  Filter,  =  2 


mm 

mm 

mm 

r4 

mm 

mm 

mm 

mm 

■■ 

11.83 

11.93 

11.84 

12.09 

11.91 

12.01 

11.89 

12.12 

11.84 

TOM 

Tn 

TOM 

Tl 

T2 

msm 

Th 

mm 

Nout 

12.05 

11.94 

12.13 

1.65 

1.65 

1.64 

0.36 

0.73 

12 

Table  5.9:  Corrected  Implicit  Method  Times  for  the  Analog  Filter,  /,„  =  3 


Tx 

mm 

mm 

wm 

mm 

mm 

mm 

Ts 

t9 

21.25 

23.67 

23.25 

24.51 

21.67 

23.29 

24.21 

24.35 

21.22 

TOM 

mm 

TOM 

T 1 

T2 

T3 

mm 

mm 

A 

23.56 

23.21 

24.75 

1.52 

1.51 

1.53 

0.34 

0.69 

11 

Table  5.10:  Corrected  Implicit  Method  Times  for  the  Analog  Filter,  =  4 


r, 

mm 

mm 

mm 

mm 

mm 

mm 

mm 

El 

29.03 

32.24 

29.98 

32.75 

29.11 

32.04 

29.87 

32.71 

29.40 

TOM 

K9 

ESI 

T1 

T2 

T3 

■ai 

N  out 

32.21 

31.05 

32.34 

1.50 

1.51 

1.50 

0.35 

0.69 

11 

Table  5.11:  Parallel  Performance  of  the  Corrected  Implicit  Method  on  the 
Analog  Filter 


T, 

mm 

sp 

eff 

i 

157.19 

17.19 

8.86 

73.83% 

2 

148.68 

14.86 

10.01 

83.40% 

3 

283.84 

27.31 

10.39 

86.58% 

4 

373.08 

34.89 

10.69 

89.08% 

ber  of  inner  iterations  is  increased  from  1  to  4.  However  the  sequential 
computing  time  Tt,  for  the  implicit  method  only  decreases  from  157.19  for 
Iin  =  1  to  148.68  for  /,„  =  2  (for  the  explicit  method  it  is  183.76),  then  it 
increases  sharply  due  to  the  cost  of  the  additional  inner  iterations.  Thus  the 
high  speedups  for  the  implicit  method  for  =  3  and  4  in  comparison  to  the 
same  sequential  method  are  not  significant  since  the  large  number  of  inner 
iterations  makes  the  algorithm  inefficient,  and  the  sequential  time  is  subop- 
timal.  For  the  optimal  number  of  inner  iterations,  Un  =  2,  the  speedup  is 
10.01  for  12  processors  and  the  efficiency  is  83.40%.  The  parallel  computa¬ 
tion  time  improvement  over  the  parallel  explicit  method  is  28%,  as  compared 
to  19%  in  the  sequential  case.  We  feel  that  this  experiment  indicates  that  ap¬ 
plying  the  implicit  method  to  solve  large  block  bordered  circuit  equations  on 
a  distributed  memory  multiprocessor  can  result  in  a  highly  efficient  method. 


6  Summary  and  Future  Research 


We  have  introduced  a  corrected  implicit  method  for  solving  block  bordered 
systems  of  nonlinear  equations.  It  allows  multiple  “inner”  iterations,  itera¬ 
tions  on  the  variables  and  equations  of  the  q  diagonal  blocks,  to  be  performed 
per  each  “outer”  iteration,  which  involves  all  the  variables  and  equations 
including  the  connecting,  bottom  block.  If  only  one  inner  iteration  is  per¬ 
formed  per  outer  iteration,  no  line  search  is  used,  and  the  bottom,  connecting 
equations  are  linear,  then  the  corrected  implicit  method  is  identical  to  the 


explicit  method  (Newton’s  method).  When  more  than  one  inner  iteration  is 
performed  per  outer  iteration,  however,  the  methods  are  different,  and  in  our 
experiments  the  corrected  implicit  method  solves  problems  in  somewhat  less 
time  than  the  explicit  method  on  sequential  computers.  On  parallel  comput¬ 
ers,  the  corrected  implicit  method  has  a  larger  advantage  over  the  explicit 
method  because  it  parallelizes  more  effectively,  since  the  inner  iterations  con¬ 
stitute  a  larger  percentage  of  the  total  computation  and  parallelize  far  more 
effectively  than  the  outer  iterations.  On  one  and  two  level  block  bordered 
problems  from  VLSI  circuit  design  that  we  tested,  the  parallel  efficiency  of 
the  fastest  (sequential  and  parallel)  corrected  implicit  method  on  an  Intel 
iPSCl  hypercube  was  about  85%. 

The  methods  presented  in  this  paper  all  assume  that  the  Jacobian  ma¬ 
trix  is  available  at  each  iteration,  either  analytically  or  by  finite  differences, 
and  that  it  isn’t  too  expensive  to  evaluate.  In  some  applications,  however, 
the  nonlinear  equations  are  given  by  an  expensive  computational  procedure, 
and  analytic  or  finite  difference  Jacobians  are  very  expensive  to  obtain.  In 
such  cases  for  general  systems  of  nonlinear  equations,  secant  approximations 
to  the  Jacobian  are  used  that  are  based  entirely  on  function  values  at  the 
iterates  (see  e.g.  Dennis  and  Schnabel  [1983]).  The  development  of  related 
secant  approximations  to  the  Jacobian  for  block  bordered  nonlinear  equa¬ 
tions  seems  to  be  an  attractive  research  topic,  since  it  appears  possible  to 
construct  approximations  that  retain  the  block  bordered  sparsity  pattern  of 
the  Jacobian  and  also  allow  the  factorization  of  the  Jacobian  approximation 
to  be  updated  efficiently. 
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APPENDIX  A 


THE  PARTITIONED  741  OP-AMP  CIRCUIT 

The  741  op-amp  circuit  was  introduced  in  1966  and  is  currently  pro¬ 
duced  by  almost  every  analog  semiconductor  manufacturer.  The  circuit  is 
partitioned  into  4  parts  with  roughly  equal  elements  in  each  sub-circuit. 
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APPENDIX  B 

THE  BLOCK  BORDERED  JACOBIAN  MATRIX 
OF  THE  PARTITIONED  741  OP-AMP  CIRCUIT 


27*27 
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APPENDIX  C 


AN  ANALOG  FILTER 
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